11/21/2020

Original Premise

  • We originally wanted to set up a scouting report for every batter in the league, using some custom statistics derived from advanced batting statistics. We would then produce a heatmap and detailed scouting report of what sorts of pitches work on every batter, and where in the zone it worked. Gleefully, we set about finding our data.
  • Our data exists, but isn’t cheap. Back to the drawing board.
  • Instead, we decided to see if we could predict the net number of wins a pitcher was worth based on his ERA (earned run average), OBP (Opponent On-base percentage), and whatever else could think of. We found a package called Lahman that would do the job.

Beginning Research

  • Here, we’re going to look at how we selected and in some cases created the column data we needed.

  • Select columns from the Pitching dataset
    playerID, yearID, teamID, IPouts, BB, SO, BAOpp, ERA, W, L

  • Create a Net Wins column
    pitchers <- pitchers %>% mutate(NetWins = W-L)

  • Only keep rows where there is no missing data
    pitchers <- pitchers[complete.cases(pitchers),]

Treat the Data

  • Normalize data so that coefficients are meaningful

mutate(normIPouts = (IPouts - mean(IPouts)) / sd(IPouts))
mutate(normBB = (BB - mean(BB)) / sd(BB))
mutate(normSO = (SO - mean(SO)) / sd(SO))
mutate(normBAOpp = (BAOpp - mean(BAOpp)) / sd(BAOpp))
mutate(normERA = (ERA - mean(ERA)) / sd(ERA))

Building linear model

-Finally, we can build our model, as is accomplished below.

rsquared:

## [1] 0.1389159

Building a Generalized Additive Model

  • That last model sucked. Let’s try again with a better model.

mygam <- gam(NetWins ~ normIPouts + normBB + normSO + normBAOpp + normERA, data = pitchers)

rsquared:

## NULL

Building a Generalized Linear Model

  • That last model sucked too. Let’s try another one–a generalized linear model.

myglm <- glm(NetWins ~ normIPouts + normBB + normSO + normBAOpp + normERA, family = gaussian, data = pitchers)

rsquared

## NULL

Building models

  • Wow, that went well. Let’s see if we can figure out any kind of a model that works at all–even ones that are nonlinear and completely unintuitive to normal humans.

  • We’re going to throw everything at it. We are inevitable.

  • NetWins ~ IPouts * HR * BB * SO * BAOp p* ERA * IBB * WP * HBP * BK * BFP * GIDP

  • Put on that infinity glove and snap your fingers

## [1] 0.5431516

After the MCU

  • Well look at that, we snapped our fingers and explained just over half of the variation. Excellent work team!

  • Just kidding, we need to be able to do a lot better than this.

  • Clearly, this isn’t going as well as it might’ve. Let’s make some graphs and see what we can find before we build another model.

Crossplot

Interesting Things

To me, the most interesting thing was strikeouts by OBP, so I made an interactive graph.

Interactive OBP x Strikeouts

OBP x Walks

Strikeouts x Walks

Advanced pitching metrics

  • We’re going to look at projecting OBP and Wins using some advanced metrics.

advancedPitchers
%>% mutate(name=paste(first_name, last_name))

advancedPitchers[order(advancedPitchers$name),]

Fastball Speed

Hard Hit Percent

Exit Velocity

Launch Angle

Barrel Batted Rate

Spin Speed on Breaking Ball

Sweet Spot Percent

New Model

  • We did find some statistics that had positive trends. Are the significant enough to be useful? Let’s find out!

  • on_base_percent ~ hard_hit_percent * sweet_spot_percent * exit_velocity_avg * year

  • Rsquared?

## [1] 0.1873352

So that sucked

  • Let’s try another model

  • Throw Everything on the pitching side.

  • on_base_percent ~ fastball_avg_speed * breaking_avg_speed * offspeed_avg_speed * fastball_avg_spin * breaking_avg_spin * offspeed_avg_spin

  • Rsquared?

## [1] 0.3821325

So that still sucked

  • Let’s try again. Throw everything on the hitting side.

  • on_base_percent ~ exit_velocity_avg * launch_angle_avg * hard_hit_percent * sweet_spot_percent * barrel_batted_rate * solidcontact_percent

  • Rsquared?

## [1] 0.3544678

Baseball is hard

Why are we having so much trouble?

Concluding graph

What did we find out?

Projecting pitching success in baseball is very unpredictable. To further support this argument consider Jacob deGrom’s Cy Young winning year in 2018 compared to Bob Welch’s Cy Young winning year in 1990.

Examples of two men: Bob Welch

##   Year  ERA   OBP  W L  IP BB  SO       Awards
## 1 1990 2.95 0.302 27 6 238 77 127 ASCYA-1MVP-9

Examples of two men: Jacob deGrom

##   Year ERA   OBP  W L  IP BB  SO       Awards
## 1 2018 1.7 0.243 10 9 217 46 269 ASCYA-1MVP-5

Where can we go from here?

After deliberation and talking more about why we couldn’t find anything useful in our research, we concluded that there are too many outside variables that can factor in to player success. Team performance is the most essential variable we could think of. deGrom’s Mets finished with 77 wins and was near the bottom of the MLB in many major ofensive categories. Welch’s A’s were the best team in baseball and finished with 103 wins and was at the top of many offensive categories. Potential for a new research topic would be the idea a new statistic called the adjusted win statistic that measures pitcher worth while being backed by the league average offense and defense. This is just one thing that could be explored in the future!

Thank You!